K-Nearest Neighbors (KNN) — Classification & Regression (From Scratch)#
KNN is one of the most intuitive ML algorithms:
store the training data
for a new point, look up the K closest points
vote (classification) or average (regression)
It’s often described as lazy learning: there’s almost no “training” — the work happens at prediction time.
What you’ll learn#
the core idea (with geometry + intuition)
the math for classification and regression
why feature scaling matters
how to implement KNN from scratch in NumPy
how to use
sklearnKNN models and understand their key parameters
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
rng = np.random.default_rng(7)
1) Intuition: “ask your closest neighbors”#
Imagine you move into a new city and want to predict the rent for an apartment.
You might ask:
“What do similar apartments nearby cost?”
That’s KNN.
Similarity is measured by a distance (e.g., Euclidean distance).
The “answer” is based on the K nearest examples.
KNN is like a well-organized memory:
no complicated training
excellent for weird-shaped patterns
but can be slow at prediction time if the dataset is huge
2) The algorithm (math + notation)#
Let:
\(X \in \mathbb{R}^{n \times d}\) be the training features
\(y\) be the targets
\(x \in \mathbb{R}^d\) be a query point
2.1 Distances#
We compute a distance \(D(x, x_i)\) between the query point and every training point.
A common family is the Minkowski distance:
\(p=2\) → Euclidean distance
\(p=1\) → Manhattan distance
2.2 Classification#
Let \(\mathcal{N}_k(x)\) be the set of indices of the \(k\) nearest neighbors.
Uniform vote:
Distance-weighted vote (common):
2.3 Regression#
Uniform average:
Distance-weighted average:
2.4 Big practical gotcha: scaling#
If one feature is measured in kilometers and another in millimeters, the kilometer feature dominates the distance.
So KNN almost always needs feature scaling (e.g. standardization).
3) Geometry: L1 vs L2 distance “balls”#
Distance isn’t just a formula — it’s a shape.
L2 distance (Euclidean) → circles (2D)
L1 distance (Manhattan) → diamonds (2D)
This changes which points count as “neighbors”.
# Visualize iso-distance curves around the origin
r = 1.0
# L2 circle
theta = np.linspace(0, 2*np.pi, 400)
x_circle = r * np.cos(theta)
y_circle = r * np.sin(theta)
# L1 diamond: |x| + |y| = r
x_diamond = np.array([0, r, 0, -r, 0])
y_diamond = np.array([r, 0, -r, 0, r])
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_circle, y=y_circle, mode="lines", name="L2: x²+y²=1"))
fig.add_trace(go.Scatter(x=x_diamond, y=y_diamond, mode="lines", name="L1: |x|+|y|=1"))
fig.update_layout(
title="Iso-distance curves in 2D",
xaxis_title="x",
yaxis_title="y",
width=700,
height=450,
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
4) A toy classification dataset (2D)#
We’ll use make_moons because it’s not linearly separable — perfect to show why KNN can be strong.
X, y = make_moons(n_samples=500, noise=0.25, random_state=7)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=7, stratify=y)
fig = px.scatter(
x=X_tr[:, 0],
y=X_tr[:, 1],
color=y_tr.astype(str),
title="Training set (make_moons)",
labels={"x": "x1", "y": "x2", "color": "class"},
)
fig.update_traces(marker=dict(size=7, opacity=0.8))
fig.update_layout(width=700, height=450)
fig.show()
5) KNN from scratch (NumPy)#
We’ll implement:
pairwise distances
“k nearest” lookup
classifier (uniform + distance-weighted)
regressor (uniform + distance-weighted)
We’ll keep it readable and educational.
def pairwise_minkowski_distances(X_query: np.ndarray, X_train: np.ndarray, p: float = 2.0) -> np.ndarray:
X_query = np.asarray(X_query, dtype=float)
X_train = np.asarray(X_train, dtype=float)
if X_query.ndim != 2 or X_train.ndim != 2:
raise ValueError("X_query and X_train must be 2D arrays")
if X_query.shape[1] != X_train.shape[1]:
raise ValueError("X_query and X_train must have the same number of features")
if p <= 0:
raise ValueError("p must be > 0")
if p == 2:
# (a-b)^2 = a^2 + b^2 - 2ab trick
q_sq = np.sum(X_query**2, axis=1, keepdims=True) # (m,1)
t_sq = np.sum(X_train**2, axis=1) # (n,)
d2 = q_sq + t_sq[None, :] - 2.0 * (X_query @ X_train.T)
d2 = np.maximum(d2, 0.0)
return np.sqrt(d2)
return np.sum(np.abs(X_query[:, None, :] - X_train[None, :, :]) ** p, axis=2) ** (1.0 / p)
def kneighbors_indices_and_distances(
X_train: np.ndarray,
X_query: np.ndarray,
k: int,
p: float = 2.0,
):
distances = pairwise_minkowski_distances(X_query, X_train, p=p) # (m,n)
if k <= 0:
raise ValueError("k must be >= 1")
if k > distances.shape[1]:
raise ValueError("k cannot exceed number of training samples")
idx = np.argpartition(distances, kth=k - 1, axis=1)[:, :k] # (m,k), unsorted
rows = np.arange(distances.shape[0])[:, None]
d_k = distances[rows, idx]
order = np.argsort(d_k, axis=1)
idx_sorted = idx[rows, order]
d_sorted = d_k[rows, order]
return idx_sorted, d_sorted
class ScratchKNNClassifier:
def __init__(self, n_neighbors: int = 5, p: float = 2.0, weights: str = "uniform", eps: float = 1e-9):
self.n_neighbors = int(n_neighbors)
self.p = float(p)
self.weights = weights
self.eps = float(eps)
def fit(self, X: np.ndarray, y: np.ndarray):
self.X_ = np.asarray(X, dtype=float)
self.y_ = np.asarray(y)
self.classes_, self.y_encoded_ = np.unique(self.y_, return_inverse=True)
return self
def _weights(self, distances: np.ndarray) -> np.ndarray:
if self.weights == "uniform":
return np.ones_like(distances, dtype=float)
if self.weights == "distance":
return 1.0 / (distances + self.eps)
raise ValueError("weights must be 'uniform' or 'distance'")
def predict_proba(self, X_query: np.ndarray) -> np.ndarray:
idx, d = kneighbors_indices_and_distances(self.X_, np.asarray(X_query, dtype=float), self.n_neighbors, p=self.p)
w = self._weights(d)
neighbor_labels = self.y_encoded_[idx]
n_query = idx.shape[0]
n_classes = self.classes_.shape[0]
proba = np.zeros((n_query, n_classes), dtype=float)
w_sum = np.sum(w, axis=1, keepdims=True)
for c in range(n_classes):
proba[:, c] = np.sum(w * (neighbor_labels == c), axis=1) / w_sum[:, 0]
return proba
def predict(self, X_query: np.ndarray) -> np.ndarray:
proba = self.predict_proba(X_query)
return self.classes_[np.argmax(proba, axis=1)]
class ScratchKNNRegressor:
def __init__(self, n_neighbors: int = 5, p: float = 2.0, weights: str = "uniform", eps: float = 1e-9):
self.n_neighbors = int(n_neighbors)
self.p = float(p)
self.weights = weights
self.eps = float(eps)
def fit(self, X: np.ndarray, y: np.ndarray):
self.X_ = np.asarray(X, dtype=float)
self.y_ = np.asarray(y, dtype=float)
return self
def _weights(self, distances: np.ndarray) -> np.ndarray:
if self.weights == "uniform":
return np.ones_like(distances, dtype=float)
if self.weights == "distance":
return 1.0 / (distances + self.eps)
raise ValueError("weights must be 'uniform' or 'distance'")
def predict(self, X_query: np.ndarray) -> np.ndarray:
idx, d = kneighbors_indices_and_distances(self.X_, np.asarray(X_query, dtype=float), self.n_neighbors, p=self.p)
w = self._weights(d)
y_neighbors = self.y_[idx]
return np.sum(w * y_neighbors, axis=1) / np.sum(w, axis=1)
5.1 One query point in slow motion#
Let’s pick a single test point and literally see which points become its neighbors.
scratch_knn = ScratchKNNClassifier(n_neighbors=10, p=2, weights="distance").fit(X_tr, y_tr)
xq = X_te[0:1]
yq = y_te[0]
idx, d = kneighbors_indices_and_distances(X_tr, xq, k=10, p=2)
idx = idx[0]
d = d[0]
print("Query true class:", yq)
print("Nearest neighbor distances:", np.round(d, 3))
print("Nearest neighbor labels:", y_tr[idx])
# Plot: training points, query point, and highlight neighbors
fig = go.Figure()
fig.add_trace(go.Scatter(x=X_tr[:, 0], y=X_tr[:, 1], mode="markers",
marker=dict(color=y_tr, colorscale="Viridis", size=7, opacity=0.6),
name="train"))
fig.add_trace(go.Scatter(x=xq[:, 0], y=xq[:, 1], mode="markers",
marker=dict(symbol="star", size=16, color="red"),
name="query"))
neighbors = X_tr[idx]
fig.add_trace(go.Scatter(x=neighbors[:, 0], y=neighbors[:, 1], mode="markers",
marker=dict(size=12, color="black", symbol="circle-open"),
name="neighbors"))
# draw thin lines from query to neighbors
for pt in neighbors:
fig.add_trace(go.Scatter(
x=[xq[0, 0], pt[0]],
y=[xq[0, 1], pt[1]],
mode="lines",
line=dict(color="rgba(0,0,0,0.15)", width=1),
showlegend=False,
))
fig.update_layout(title="A query point and its 10 nearest neighbors", width=750, height=500)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
Query true class: 1
Nearest neighbor distances: [0.028 0.121 0.128 0.173 0.18 0.201 0.205 0.208 0.211 0.22 ]
Nearest neighbor labels: [1 1 1 1 1 1 1 1 1 1]
5.2 Decision boundary (from scratch)#
To visualize KNN, we’ll compute predictions on a grid and color the plane.
def plot_knn_boundary_2d(model, X_train, y_train, title: str, grid_steps: int = 220):
x_min, x_max = X_train[:, 0].min() - 0.8, X_train[:, 0].max() + 0.8
y_min, y_max = X_train[:, 1].min() - 0.8, X_train[:, 1].max() + 0.8
xs = np.linspace(x_min, x_max, grid_steps)
ys = np.linspace(y_min, y_max, grid_steps)
xx, yy = np.meshgrid(xs, ys)
grid = np.c_[xx.ravel(), yy.ravel()]
if hasattr(model, "predict_proba"):
proba = model.predict_proba(grid)
z = proba[:, 1].reshape(xx.shape)
z_title = "P(class=1)"
else:
z = model.predict(grid).reshape(xx.shape)
z_title = "prediction"
fig = go.Figure()
fig.add_trace(go.Contour(
x=xs,
y=ys,
z=z,
colorscale="RdBu",
opacity=0.75,
contours=dict(showlines=False),
colorbar=dict(title=z_title),
))
fig.add_trace(go.Scatter(
x=X_train[:, 0],
y=X_train[:, 1],
mode="markers",
marker=dict(color=y_train, colorscale="Viridis", size=6, line=dict(width=0.5, color="white")),
name="train",
))
fig.update_layout(title=title, width=750, height=520)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
return fig
for k in [1, 5, 25]:
m = ScratchKNNClassifier(n_neighbors=k, p=2, weights="uniform").fit(X_tr, y_tr)
fig = plot_knn_boundary_2d(m, X_tr, y_tr, title=f"Scratch KNN boundary (k={k}, uniform)")
fig.show()
5.3 Bias–variance intuition: small k vs large k#
K is like a zoom level:
k=1: super local, very wiggly boundary → low bias, high variance
large k: very smooth boundary → higher bias, lower variance
A useful mental model:
“How many neighbors should I ask for advice?”
If you ask one person, you might get an extreme opinion. If you ask a hundred, you get a stable average — but you might miss niche local context.
def accuracy_for_k(k: int, weights: str = "uniform") -> float:
model = ScratchKNNClassifier(n_neighbors=k, p=2, weights=weights).fit(X_tr, y_tr)
y_pred = model.predict(X_te)
return float(accuracy_score(y_te, y_pred))
ks = list(range(1, 51, 2))
acc_uniform = [accuracy_for_k(k, weights="uniform") for k in ks]
acc_distance = [accuracy_for_k(k, weights="distance") for k in ks]
fig = go.Figure()
fig.add_trace(go.Scatter(x=ks, y=acc_uniform, mode="lines+markers", name="uniform"))
fig.add_trace(go.Scatter(x=ks, y=acc_distance, mode="lines+markers", name="distance"))
fig.update_layout(
title="Scratch KNN test accuracy vs k",
xaxis_title="k",
yaxis_title="accuracy",
width=750,
height=450,
)
fig.show()
best_k = ks[int(np.argmax(acc_distance))]
print("Best k (distance weights) in this sweep:", best_k)
Best k (distance weights) in this sweep: 25
6) KNN regression (1D example)#
Regression KNN is like:
“What’s the average target value of the nearest examples?”
It produces a piecewise smooth curve.
# 1D regression dataset
n_reg = 120
x_reg = np.linspace(-3, 3, n_reg)
noise_reg = rng.normal(0, 0.25, size=n_reg)
y_reg = np.sin(x_reg) + 0.2 * x_reg + noise_reg
X1 = x_reg.reshape(-1, 1)
# Train/test split
perm = rng.permutation(n_reg)
train_idx, test_idx = perm[:90], perm[90:]
X1_tr, y_tr_reg = X1[train_idx], y_reg[train_idx]
X1_te, y_te_reg = X1[test_idx], y_reg[test_idx]
x_grid = np.linspace(x_reg.min() - 0.2, x_reg.max() + 0.2, 500).reshape(-1, 1)
fig = px.scatter(x=X1_tr[:, 0], y=y_tr_reg, title="1D regression training data", labels={"x": "x", "y": "y"})
fig.update_traces(marker=dict(size=7, opacity=0.8))
fig.update_layout(width=750, height=450)
fig.show()
def rmse(y_true, y_pred):
return float(np.sqrt(mean_squared_error(y_true, y_pred)))
fig = go.Figure()
fig.add_trace(go.Scatter(x=X1_tr[:, 0], y=y_tr_reg, mode="markers", name="train", marker=dict(size=7, opacity=0.7)))
for k in [1, 5, 15, 35]:
model = ScratchKNNRegressor(n_neighbors=k, p=2, weights="uniform").fit(X1_tr, y_tr_reg)
y_grid = model.predict(x_grid)
y_pred_te = model.predict(X1_te)
fig.add_trace(go.Scatter(x=x_grid[:, 0], y=y_grid, mode="lines", name=f"k={k} (RMSE={rmse(y_te_reg, y_pred_te):.3f})"))
fig.update_layout(title="Scratch KNN regression: effect of k", xaxis_title="x", yaxis_title="y", width=900, height=500)
fig.show()
6.1 Uniform vs distance weights (regression)#
With distance weights, closer points count more.
helps when the “right answer” is very local
can be noisier when points are extremely close (hence the small \(\varepsilon\))
fig = go.Figure()
fig.add_trace(go.Scatter(x=X1_tr[:, 0], y=y_tr_reg, mode="markers", name="train", marker=dict(size=7, opacity=0.7)))
k = 15
m_uniform = ScratchKNNRegressor(n_neighbors=k, p=2, weights="uniform").fit(X1_tr, y_tr_reg)
m_distance = ScratchKNNRegressor(n_neighbors=k, p=2, weights="distance").fit(X1_tr, y_tr_reg)
fig.add_trace(go.Scatter(x=x_grid[:, 0], y=m_uniform.predict(x_grid), mode="lines", name="uniform"))
fig.add_trace(go.Scatter(x=x_grid[:, 0], y=m_distance.predict(x_grid), mode="lines", name="distance"))
fig.update_layout(title=f"Scratch KNN regression: weights (k={k})", xaxis_title="x", yaxis_title="y", width=900, height=500)
fig.show()
7) Why scaling matters (a dramatic demo)#
We’ll make a dataset with two features:
\(x_1\) in range ~[-2, 2]
\(x_2\) in range ~[0, 2000]
If we don’t scale, \(x_2\) dominates the distance — and KNN behaves as if \(x_1\) barely exists.
# Make a scaled-up second feature that should NOT dominate the decision
X_bad = X.copy()
X_bad[:, 1] = (X_bad[:, 1] - X_bad[:, 1].min()) / (np.ptp(X_bad[:, 1]) + 1e-9) # normalize to [0,1]
X_bad[:, 1] = X_bad[:, 1] * 2000 # huge scale
Xb_tr, Xb_te, yb_tr, yb_te = train_test_split(X_bad, y, test_size=0.3, random_state=7, stratify=y)
# Scratch KNN without scaling
m_raw = ScratchKNNClassifier(n_neighbors=15, weights="uniform").fit(Xb_tr, yb_tr)
acc_raw = accuracy_score(yb_te, m_raw.predict(Xb_te))
# sklearn Pipeline with scaling
pipe = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=15, weights="uniform")),
])
pipe.fit(Xb_tr, yb_tr)
acc_scaled = accuracy_score(yb_te, pipe.predict(Xb_te))
print(f"Accuracy without scaling: {acc_raw:.3f}")
print(f"Accuracy with scaling : {acc_scaled:.3f}")
Accuracy without scaling: 0.880
Accuracy with scaling : 0.973
# Plot boundaries (note: x2 scale is huge)
fig1 = plot_knn_boundary_2d(m_raw, Xb_tr, yb_tr, title="Scratch KNN boundary (NO scaling)")
fig1.show()
# Boundary after scaling using sklearn model: we need a wrapper with predict_proba
class _SklearnWrapper:
def __init__(self, model):
self.model = model
def predict_proba(self, Xq):
return self.model.predict_proba(Xq)
fig2 = plot_knn_boundary_2d(_SklearnWrapper(pipe), Xb_tr, yb_tr, title="sklearn KNN boundary (StandardScaler + KNN)")
fig2.show()
8) Curse of dimensionality (why KNN can struggle in high dimensions)#
In high dimensions, points tend to become “equally far apart”.
A classic demo:
sample points uniformly in \([0,1]^d\)
look at the ratio: (nearest distance) / (farthest distance)
As dimension \(d\) grows, the ratio approaches 1.
That means:
the nearest neighbor isn’t that much nearer than a random point
So KNN can lose its “locality” advantage unless you reduce dimension or have a huge dataset.
def nn_distance_ratio(dim: int, n_points: int = 1500, seed: int = 7) -> float:
r = np.random.default_rng(seed + dim)
Xd = r.random((n_points, dim))
xq = r.random((1, dim))
dists = pairwise_minkowski_distances(xq, Xd, p=2)[0]
return float(dists.min() / dists.max())
dims = list(range(1, 51, 2))
ratios = [nn_distance_ratio(d) for d in dims]
fig = go.Figure()
fig.add_trace(go.Scatter(x=dims, y=ratios, mode="lines+markers"))
fig.update_layout(
title="Nearest/farthest distance ratio vs dimension",
xaxis_title="dimension d",
yaxis_title="min(dist)/max(dist)",
width=750,
height=450,
)
fig.show()
9) Practical KNN with scikit-learn#
sklearn provides optimized KNN implementations with a clean interface.
9.1 Key parameters (Classifier / Regressor)#
n_neighbors: Kweights:'uniform'or'distance'(or a custom function)p: Minkowski distance power (p=2Euclidean,p=1Manhattan)metric: distance metric (default'minkowski')algorithm:'auto','ball_tree','kd_tree','brute'leaf_size: affects tree-based neighbor search performancen_jobs: parallelism (when supported)
A good default starting point:
scale features with
StandardScalertry a small sweep over
n_neighborscompare
weights='uniform'vsweights='distance'
# A simple sklearn baseline
sk_knn = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=15, weights="distance", p=2)),
])
sk_knn.fit(X_tr, y_tr)
y_pred = sk_knn.predict(X_te)
print("Test accuracy:", accuracy_score(y_te, y_pred))
Test accuracy: 0.9466666666666667
9.2 A small grid search (select K + weights)#
This is the “grown-up” version of our manual accuracy sweep.
param_grid = {
"knn__n_neighbors": list(range(1, 51, 2)),
"knn__weights": ["uniform", "distance"],
"knn__p": [1, 2],
}
pipe = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier()),
])
gs = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)
gs.fit(X_tr, y_tr)
print("Best params:", gs.best_params_)
print("Best CV score:", gs.best_score_)
print("Test accuracy:", accuracy_score(y_te, gs.best_estimator_.predict(X_te)))
Best params: {'knn__n_neighbors': 7, 'knn__p': 1, 'knn__weights': 'uniform'}
Best CV score: 0.9314285714285715
Test accuracy: 0.9533333333333334
10) Summary: when to use KNN#
Great when:
you want a strong baseline with minimal assumptions
the decision boundary is complex and you have enough data
interpretability as “similar examples” is valuable
Be careful when:
features have different scales (always scale)
you’re in very high dimensions (consider PCA / feature selection)
you need fast predictions at huge scale (consider approximate nearest neighbors or different models)
KNN is like a good memory + a sensible notion of similarity.
Exercises#
Implement a custom distance that weighs features differently.
Add tie-breaking rules explicitly for classification.
Compare
p=1vsp=2onmake_moons.For regression, compare KNN to linear regression on the same dataset.
References#
scikit-learn docs: KNeighborsClassifier / KNeighborsRegressor
“The Elements of Statistical Learning” — KNN discussion